DeepSeek: enable option to merge Q and K tensors #941

ikawrakow · 2025-11-11T17:32:44Z

The DeepSeek self attention mechanism is quite different from other models, so merging "Q" and "K" model tensors is also much trickier than doing that for standard self attention. But I was curious to see if it can be done, and this PR shows that it is possible.

For DeepSeek-Lite fully offloaded this gives 1.5-2% benefit for TG performance.

I cannot test with the larger siblings (R1/V3/Kimi2), so not sure if I have not broken something as there is one additional matrix multiplication involved, and it is easy to make a mistake with the views into the result of the merged matrix multiplication.

As with other Q/K/V merges, enabling this will disable mmap.

The option is disabled by default and gets enabled by -mqkv.

Ph0rk0z · 2025-11-11T21:19:16Z

Didn't see much of a boost or any negative side effects. Tested IQ2 V3.

ikawrakow · 2025-11-12T05:28:39Z

@Ph0rk0z

Thanks for testing. Did you use -mqkv in your testing?

calvin2021y · 2025-11-12T08:06:46Z

for short context and cpu only, Kimi-K2-Thinking-UD-Q4_K_XL without -mqkv I get 12.7 tps, with -mqkv I get 12.6 tps. Do i need use-ctk q8_0?

ikawrakow · 2025-11-12T08:10:28Z

Do i need use-ctk q8_0?

When running CPU-only -ctk q8_0 tends to improve performance, with benefit increasing with context length.

calvin2021y · 2025-11-12T08:12:30Z

this pr should improve performance without -ctk q8_0 ?

on my zen4 cpu test, -ctk q8_0 slow down tps for short context.

I will try some more context length like over 10K.

ikawrakow · 2025-11-12T08:17:31Z

Yes, the change in performance should not depend on the KV cache type.

But I'm surprised your Zen4 CPU has a lower performance for q8_0 K-cache. What is the CPU and how many threads are you using?

calvin2021y · 2025-11-12T08:29:33Z

9454P, --threads 48 --threads-batch 96

ikawrakow · 2025-11-12T08:37:44Z

--threads-batch 96 improves PP performance compared to --threads-batch 48 ?

calvin2021y · 2025-11-12T10:27:17Z

thanks for the tips, test this pr without -mqkv:

--threads 48 --threads-batch 96 73.8 PP tps, 11.3 tps.
--threads 48 --threads-batch 48 79.0 PP tps, 11.5 tps.

with -mqkv:

--threads 48 --threads-batch 96 73.7 PP tps, 11.3 tps.
--threads 48 --threads-batch 48 78.8 PP tps, 11.5 tps.

Ph0rk0z · 2025-11-12T12:28:52Z

Yep, I put mqkv.

on the other topic with my xeons:
I too see slightly more PP on llama-bench with the hypercores, but TG suffers and the PP gain is not consistent. I have settled into just using 48 threads. Even if numa distribute sometimes picks the hypercore instead of a physical core. It usually leaves the physical mirror alone in that case. Post initial load, using numa numactl respects the cores but speeds slightly lower or the same.

A lot of these tweaks have been minor on their own and then I put them all on one day and gain a t/s or two. Individually they are often lost in the noise of the sweep bench.

BTW, llama-bench is segfaulting with deepseek for some reason.

CUDA_VISIBLE_DEVICES=0,1,2,3 numactl -C 0-47 --interleave=all ./bin/llama-bench \
    -m /DeepSeek-V3-0324-UD-IQ2_XXS-00001-of-00005.gguf \
    -t 48 \
    --numa distribute \
    -ngl 62 \
    -ctk q8_0 \
    -ctv q8_0 \
    -mmp 0 \
    -mla 3 \
    -ub 4096 \
    -b 4096 \
    -amb 1024 \
    -mqkv 1 \
    -cuda offload-batch-size=0,fusion=1 \
    -ot "blk\.(6|7|8)\.ffn_.*(exps).=CUDA0" \
    -ot "blk\.(9|10|11|12)\.ffn_.*(exps).=CUDA1" \
    -ot "blk\.(13|14|15|16)\.ffn_.*(exps).=CUDA2" \
    -ot "blk\.(17|18|19|20)\.ffn_.*(exps).=CUDA3" \
    -ot "ffn_.*_exps.=CPU" \
    -p 32,64,128,256,512,1024,2048


ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | type_k | type_v | mla |   amb | mmap | mqkv |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --: | ----: | ---: | ---: | ------------: | ---------------: |
Segmentation fault (core dumped)

ikawrakow · 2025-11-12T12:29:05Z

Yes, on the CPU it may not bring any benefits. It is mostly for inference with full GPU offload when the cost of kernel launch is not negligible compared to the kernel processing time (i.e., for not too large models).

But at least it looks like I haven't broken the graph building, which is good news.

ikawrakow · 2025-11-12T12:36:35Z

BTW, llama-bench is segfaulting with deepseek for some reason.

Can you run

CUDA_VISIBLE_DEVICES=0,1,2,3 numactl -C 0-47 --interleave=all gdb --args ./bin/llama-bench all_other_args here

and then say run when the gdb prompt comes up. When it crashes, type bt and post the output.

Ph0rk0z · 2025-11-12T12:44:21Z

| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | type_k | type_v | mla |   amb | mmap | mqkv |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -----: | -----: | --: | ----: | ---: | ---: | ------------: | ---------------: |
[New Thread 0x7fff9699c000 (LWP 97385)]
[New Thread 0x7fff9619b000 (LWP 97386)]
[New Thread 0x7fff7b3fd000 (LWP 97387)]
[New Thread 0x7fff7abfc000 (LWP 97388)]
[New Thread 0x7fff61fff000 (LWP 97389)]
[New Thread 0x7fff617fe000 (LWP 97390)]
[New Thread 0x7fff5adde000 (LWP 97391)]
[New Thread 0x7fff47fff000 (LWP 97392)]

Thread 1 "llama-bench" received signal SIGSEGV, Segmentation fault.
0x00007ffff7f3749b in llm_build_context::build_deepseek2() () from /home/supermicro/ai/ik_llama.cpp/src/libllama.so
(gdb) bt
#0  0x00007ffff7f3749b in llm_build_context::build_deepseek2() () from /home/supermicro/ai/ik_llama.cpp/src/libllama.so
#1  0x00007ffff7f40aa9 in llm_build_context::llama_build_graph(llama_context&, llama_batch const&, bool) ()
   from /home/supermicro/ai/ik_llama.cpp/src/libllama.so
#2  0x00007ffff7e6d583 in llama_decode () from /home/supermicro/ai/ik_llama.cpp/src/libllama.so
#3  0x000055555556cbc1 in test_prompt(llama_context*, int, int, int, int) [clone .constprop.0] ()
#4  0x00005555555656be in main ()
(gdb)

ikawrakow · 2025-11-12T12:50:09Z

Thanks!

Just to make sure the crash is with this PR?

It crashes when building the graph. Not sure I understand why it works for @calvin2021y but crashes for you. And I understand even less why it crashes for you in llama-bench but does not crash with llama-sweep-bench or llama-server.

Ph0rk0z · 2025-11-12T13:00:26Z

I noticed it after this PR but I think a little earlier. I've been trying to build the chart from: #910

I successfully did it for glm but not for deepseek.

Iwan Kawrakow added 2 commits November 11, 2025 18:40

Merge Q and K for DeepSeek

0576d42

Formatting

b73d66f

ikawrakow merged commit 9e2b21f into main Nov 14, 2025

DeepSeek: enable option to merge Q and K tensors #941

DeepSeek: enable option to merge Q and K tensors #941

Uh oh!

Conversation

ikawrakow commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Nov 11, 2025

Uh oh!

ikawrakow commented Nov 12, 2025

Uh oh!

calvin2021y commented Nov 12, 2025

Uh oh!

ikawrakow commented Nov 12, 2025

Uh oh!

calvin2021y commented Nov 12, 2025

Uh oh!

ikawrakow commented Nov 12, 2025

Uh oh!

calvin2021y commented Nov 12, 2025

Uh oh!

ikawrakow commented Nov 12, 2025

Uh oh!

calvin2021y commented Nov 12, 2025

Uh oh!

Ph0rk0z commented Nov 12, 2025

Uh oh!

ikawrakow commented Nov 12, 2025

Uh oh!

ikawrakow commented Nov 12, 2025

Uh oh!

Ph0rk0z commented Nov 12, 2025

Uh oh!

ikawrakow commented Nov 12, 2025

Uh oh!

Ph0rk0z commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ikawrakow commented Nov 11, 2025 •

edited

Loading